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(57) ABSTRACT 

A method of monitoring in a distributed computer network 
having a management server servicing a set of managed 
computers. The method begins by deploying a management 
infrastructure across a given subset of the managed 
computers, the management infrastructure comprising a 
runtime environment installed at a given managed computer. 
At the given managed computer, the routine executes a 
monitoring agent in the runtime environment to determine 
whether a given threshold has been exceeded. Then, a given 
action is taken if the given threshold has been exceeded. The 
monitoring agent is executed upon receipt of an external 
event or as a result of an internal timer. Execution of the 
monitoring agent involves taking a measurement, comparing 
the measurement against the given threshold, and then 
taking some corrective action if possible. 
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SYSTEM, METHOD AND COMPUTER The prior art has not adequately addressed these and other 

PROGRAM PRODUCT FOR MONITORING problems. Thus, there remains a need to provide more 

IN A DISTRIBUTED COMPUTING efficient monitoring techniques within a distributed com- 

ENVIRONMENT puter environment wherein distributed monitors use events 

5 to convey status changes in monitored objects within the 

BACKGROUND OF THE INVENTION environment. 

1. Technical Field BRIEF SUMMARY OF THE INVENTION 

The present invention is directed to managing a large l4 . . *• . c * -j 

distributed computer enterprise environment^, mole » » *f a ol f ct of mis m ^ ntKjn <° P 1 ™^ 

particularly, to correlating system and network events in a 10 monitoring of resources within a distributed 

system having distributed monitors that use events to convey computing environment. 

status changes in monitored objects. It is another primary object of this invention to implement 

2. Description of the Related Art a distributed monitor runtime environment at given nodes in 
Companies now desire to place all of their computing a lar S e distributed computer network to facilitate the task of 

resources on the company network. To this end, it is known 15 resource monitoring. 

to connect computers in a large, geographically -dispersed It is still another important object of the present invention 

network environment and to manage such an environment in to provide a robust event-driven control mechanism for 

a distributed manner. One such management framework correcting out-of-tolerance conditions identified with 

comprises a server that manages a number of nodes, each of respect to resources being monitored in a local network 

which has a local object database that stores object data 20 system 

specific to the local node. Each managed node ideally it is yet another object of the present invention to facilitate 

includes a management framework, .comprising a number of im > f monit( ^ ^abilities to new endpoint 

management routines, that is capable of a relatively large .. . . 6 * . ~T """V"™' 

number (e.g., hundreds) of simultaneous network connec- ma f mes m a ,al S e com P uter nctwork 15 network 15 

tions to remote machines. As the number of managed nodes 25 51:3 

increases, the system maintenance problems also increase, A more general object of this invention is to provide 

as do the odds of a machine failure or other fault. resource monitoring across a distributed computer environ- 

The problem is exacerbated in a typical enterprise as the ment. 

node number rises. Of these nodes, only a small percentage These and other objects of the invention are provided in 

are file servers, name servers, database servers, or anything 30 a method of monitoring implemented within a distributed 

but end-of-wire or "endpoint" machines. The majority of the environment having a management server and a set of 

network machines are simple personal computers ("PC's") managed machines. A given subset of the managed 

or workstations that see little management activity during a machines include a distributed management infrastructure, 

normal day. In particular, each managed machine in the given subset 

Thus, as the size of the distributed computing environ- 35 includes a runtime environment, which is a platform-level 

ment increases, it becomes more difficult to centrally moni- service that can load and execute software agents. One or 

tor system and network events that convey status changes in more software agents are deployable within the distributed 

various monitored objects (e.g., nodes, systems, computers, environment to facilitate management and other control 

subsystems, devices and the like). In the prior art, it is known tasks. The runtime environment at a particular node includes 

to distribute event monitor devices across machines that are 40 a runtime engine, and a distributed monitor (DM) for 

being centrally managed. Such event monitors, however, carrying out monitoring tasks. 

typically use a full-fledged inference engine to match event A representative monitoring operation involves making a 

data to given conditions sought to be monitored. An "infer- measurement, comparing the measured value against 

ence engine" is a software engine within an expert system threshold's), and performing a response for out-of-tolerance 

that draws conclusions from rules and situational facts. 45 conditions. According to the present invention, a monitoring 

Implementation of the event monitor in this fashion requires agent may be triggered to run via a timer or upon satisfaction 

significant local system resources (e.g., a large database), of a given correlation condition. An event correlator is used 

which is undesirable. Indeed, as noted above, it is a design to determine whether the given correlation condition has 

goal to use only a lightweight manage ment framework been met. 

within the endpoint machines being managed. 50 In accordance with one aspect of the invention, there is 
Prior art techniques have several other significant disad- described a method of monitoring in a distributed computer 
vantages. One problem is lack of scalability. As the number network having a management server servicing a set of 
of connected nodes increases, it has not been possible for an managed computers. The method begins by deploying a 
administrator to easily add monitoring capabilities to an management infrastructure across a given subset of the 
appropriate subset of the endpoints with minimal effort. 55 managed computers, the management infrastructure corn- 
Even when the monitoring application can be configured, it prising a runtime environment installed at a given managed 
may not operate appropriately under peak conditions. computer. At the given managed computer, the routine 
Another significant problem is that local monitors do not executes a monitoring agent in the runtime environment to 
have sufficient built-in response capability. In large distrib- determine whether a given threshold has been exceeded, 
uted systems, it is often insufficient to note merely that a 60 Then, a given action is taken if the given threshold has been 
monitored value of a particular resource is out of tolerance. exceeded. The monitoring agent is executed upon receipt of 
Whenever possible, a local attempt to correct the situation an external event or as a result of an internal timer. Exccu- 
must be made. Known systems do not have adequate local tion of the monitoring agent involves taking a measurement, 
response capability. Moreover, some errors have no local comparing the measurement against the given threshold, and 
remedy and, in those cases, the response must have a 65 then taking some corrective action if possible, 
corresponding remote action that can be triggered by the Another aspect of the present invention is a method of 
client error. monitoring in a distributed computer network having a set of 



03/31/2004, EAST Version: 1.4.1 



US 6,5: 

3 

managed computers, wherein a management infrastructure 
is deployed across a given subset of the managed computers 
and comprises a runtime environment installed at a given 
managed computer. The method begins by establishing an 
event class registration list at a given managed computer. 
Upon receipt of an event having an event class associated 
therewith, the routine then examines the registration list to 
determine whether a given monitoring task has expressed 
interest in the event class. If so, the event is processed 
through a correlator. Then, a given action is taken (e.g., 
executing the given monitoring task) if a condition 
expressed in a correlation rule associated with the monitor- 
ing task has been met The given monitoring task may 
include a response function to attempt to correct the condi- 
tion that triggered the task. 

Another aspect of this invention is a monitor system for 
use in a managed machine connected in a distributed com- 
puter network. The monitor system comprises a runtime 
engine, and an event correlator/router executable in the 
runtime engine and responsive to an event stream to deter- 
mine whether a set of one or more events satisfying a given 
correlation condition have been received. At least one moni- 
tor task is also executable in the runtime engine upon 
satisfaction of the given correlation condition to effect 
monitoring of a managed local resource. The monitor task 
may also implement a correction task using the runtime 
engine or other local resources. 

The foregoing has outlined some of the more pertinent 
objects of the present invention. These objects should be 
construed to be merely illustrative of some of the more 
prominent features and applications of the invention. Many 
other beneficial results can be attained by applying the 
disclosed invention in a different manner or modifying the 
invention as will be described. Accordingly, other objects 
and a fuller understanding of the invention may be had by 
referring to the following Detailed Description of the pre- 
ferred embodiment. 

BRIEF DESCRIPTION OF THE DRAWINGS 

For a more complete understanding of the present inven- 
tion and the advantages thereof, reference should be made to 
the following Detailed Description taken in connection with 
the accompanying drawings in which: 

FIG. 1 illustrates a simplified diagram showing a large 
distributed computing enterprise environment in which the 
present invention is implemented; 

FIG. 2 is a block diagram of a preferred system manage- 
ment framework illustrating how the framework function- 
ality is distributed across the gateway and its endpoints 
within a managed region; 

FIG. 2Ais a block diagram of the elements that comprise 
the LCF client component of the system management frame- 
work; 

FIG. 3 illustrates a smaller "workgroup" implementation 
of the enterprise in which the server and gateway functions 
are supported on the same machine; 

FIG. 4 is a distributed computer network environment 
having a management infrastructure for use in carrying out 
the preferred method of the present invention; 

FIG. 5 is a block diagram illustrating a preferred runtime 
environment located at a managed machine within the 
distributed computer network; 

FIG. 6 is a block diagram illustrating how a particular 
monitoring task or agent may be triggered; 

FIG. 7 is a block diagram of the event routing module of 
the distributed monitor; and 
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FIG. 8 is a block diagram of an event correlator of the 
distributed monitor of the present invention. 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENT 

Referring now to FIG. 1, the invention is preferably 
implemented in a large distributed computer environment 10 
comprising up to thousands of "nodes." The nodes will 
typically be geographically dispersed and the overall envi- 

) ronment is "managed" in a distributed manner. Preferably, 
the managed environment (ME) is logically broken down 
into a series of loosely-connected managed regions (MR) 
12, each with its own management server 14 for managing 
local resources with the MR. The network typically will 

' include other servers (not shown) for carrying out other 
distributed network functions. These include name servers, 
security servers, file servers, threads servers, time servers 
and the like. Multiple servers 14 coordinate activities across 
the enterprise and permit remote site management and 

1 operation. Each server 14 serves a number of gateway 
machines 16, each of which in turn supports a plurality of 
endpoints 18. The server 14 coordinates all activity within 
the MR using a terminal node manager 20, 

; Referring now to FIG. 2, each gateway machine 16 runs 
a server component 22 of a system management framework. 
The server component 22 is a multi-threaded runtime pro- 
cess that comprises several components: an object request 
broker or "ORB" 21, an authorization service 23, object 

I location service 25 and basic object adaptor or "BOA" 27. 
Server component 22 also includes an object library 29. 
Preferably, the ORB 21 runs continuously, separate from the 
operating system, and it communicates with both server and 
client processes through separate stubs and skeletons via an 

; interprocess communication (IPC) facility 19. In particular, 
a secure remote procedure call (RPC) is used to invoke 
operations on remote objects. Gateway machine 16 also 
includes an operating system 15 and a threads mechanism 
17. 

40 The system management framework includes a client 
component 24 supported on each of the endpoint machines 
18. The client component 24 is a low cost, low maintenance 
application suite that is preferably "dataless" in the sense 
that system management data is not cached or stored there 

45 in a persistent manner. Implementation of the management 
framework in this "client-server" manner has significant 
advantages over the prior art, and it facilitates the connec- 
tivity of personal computers into the managed environment. 
Using an object-oriented approach, the system management 

so framework facilitates execution of system management 
tasks required to manage the resources in the MR. Such 
tasks are quite varied and include, without limitation, file 
and data distribution, network usage monitoring, user 
management, printer or other resource configuration 

55 management, and the like. 

In the large enterprise such as illustrated in FIG. 1, 
preferably there is one server per MR with some number of 
gateways. For a workgroup-size installation (e.g., a local 
area network) such as illustrated in FIG. 3, a single server- 

60 class machine may be used as the server and gateway, and 
the client machines would run a low maintenance frame- 
work References herein to a distinct server and one or more 
gateway(s) should thus not be taken by way of limitation as 
these elements may be combined into a single platform. For 

65 intermediate size installations the MR grows breadth-wise, 
with additional gateways then being used to balance the load 
of the endpoints. 
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The server is the top-level authority over all gateway and 
endpoints. The server maintains an eodpoint list, which 
keeps track of every endpoint in a managed region. Hiis list 
preferably contains all information necessary to uniquely 
identify and manage endpoints including, without limitation, 
such information as name, location, and machine type. The 
server also maintains the mapping between endpoint and 
gateway, and this mapping is preferably dynamic. 

As noted above, there are one or more gateways per 
managed region. Preferably, a gateway is a fully-managed 
node that has been configured to operate as a gateway. 
Initially, a gateway "knows" nothing about endpoints. As 
endpoints login, the gateway builds an endpoint list for its 
endpoints. The gateway's duties preferably include: listen- 
ing for endpoint login requests, listening for endpoint update 
requests, and (its main task) acting as a gateway for method 
invocations on endpoints. 

As also discussed above, the endpoint is a machine 
running the system management framework client 
component, which is referred to herein as the low cost 
framework (LCF). The LCF has two main parts as illustrated 
in FIG. 2A: the LCF daemon 24a and an application runtime 
library 24b, The LCF daemon 24a is responsible for end- 
point login and for spawning application endpoint 
executables. Once an executable is spawned, the LCF dae- 
mon 24a has no further interaction with it. Each executable 
is linked with the application runtime library 24b, which 
handles alt further communication with the gateway. 

Preferably, the server and each of the gateways is a 
computer or "machine." For example, each computer may 
be a RISC System/6000® (a reduced instruction set or 
so-called RISC-based workstation) running the AIX 
((Advanced Interactive Executive) operating system, pref- 
erably Version 3.2.5 or greater. Suitable alternative 
machines include: an IBM-compatible PC x86 or higher 
running Novell UnixWare 2.0, an AT&T 3000 series running 
AT&T UNIX SVR4 MP-RAS Release 2.02 or greater, Data 
General AViiON series running DG/UX version 5.4R3.00 or 
greater, an HP9000/700 and 800 series running HP/UX 9.00 
through HP/UX 9.05. Motorola 88K series running SVR4 
version R40V4.2, a Sun SPARC series running Solaris 2.3 or 
2.4, or a Sun SPARC series running SunOS 4.1.2 or 4.1.3. 
Of course, other machines and/or operating systems may be 
used as well for the gateway and server machines. 

Each endpoint is also a computer. In one preferred 
embodiment of the invention, most of the endpoints are 
personal computers (e.g., desktop machines or laptops). In 
this architecture, the endpoints need not be high powered or 
complex machines or workstations. One or more of the 
endpoints may be a notebook computer, e.g., the IBM 
ThinkPad® machine, or some other Intel x86 or Pentium®- 
based computer running Windows '95 or greater operating 
system. IBM® or IBM -compatible machines running under 
the OS/2® operating system may also be implemented as the 
endpoints. An endpoint computer preferably includes a 
browser, such as Netscape Navigator or Microsoft Internet 
Explorer, and may be connected to a gateway via the 
Internet, an intranet or some other computer network. 

Preferably, the client-class framework running on each 
endpoint is a low-maintenance, low-cost framework that is 
ready to do management tasks but consumes few machine 
resources (because it is normally in an idle state). Each 
endpoint may be "dataless" in the sense that system man- 
agement data is not stored therein before or after a particular 
system management task is implemented or carried out. 

This architecture advantageously enables a rational par- 
titioning of the enterprise with 10's of servers, 100's of 
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gateway machines, and 1000's of endpoints. Each server 
typically serves up to 200 gateways, each of which services 
1000's of endpoints. Al the framework level, all operations 
to or from an endpoint may pass through a gateway machine. 

5 In many operations, the gateway is transparent; it receives a 
request, determines the targets, resends the requests, waits 
for results, then returns results back to the caller. Each 
gateway handles multiple simultaneous requests, and there 
may be any number of gateways in an enterprise, with the 

10 exact number depending on many factors including the 
available resources and the number of endpoints that need to 
be serviced. 

As distributed systems such as described above grow in 
size and complexity, management becomes more difficult. 

15 To facilitate system management, certain of the managed 
machines may include a uniform "engine" that executes one 
or more tasks (e.g., software "agents") that have been 
distributed by a central mechanism. This architecture is 
illustrated in FIG. 4. 

20 In this embodiment, a set of "software agents" 37 are 
available at a central location (e.g., manager 14) or at a 
plurality of locations (e.g., the gateways 16) in the network 
where administrative, configuration or other management 
tasks are specified, configured and/or deployed. The soft- 

25 ware agents are "mobile" in the sense that the agents are 
dispatched from a dispatch mechanism 35 and then migrate 
throughout the network environment Generally, as will be 
seen, the mobile software agents traverse the network to 
perform or to facilitate various network and system man- 

30 agement tasks. Alternatively, dispatch mechanism 35 may 
include a set of configurable software tasks 39 from which 
one or more agents are constructed. Manager 14 preferably 
also includes a database 43 including information identify- 
ing a list of all machines in the distributed computing 

35 environment that are designed to be managed. The dispatch 
mechanism itself may be distributed across multiple nodes. 

At least some of the gateway nodes 16 and at least some 
of the terminal nodes 18 (or some defined subset thereof) 

^ include a runtime environment 41 that has been downloaded 
to the particular node via a distribution service. The runtime 
environment 41 includes a runtime engine (as well as other 
components) for a software agent as will be described. 
Software agents are deployable within the network to per- 

45 form or to facilitate a particular administration, configura- 
tion or other management task specified by an administrator 
or other system entity. Preferably, the software agent is a 
piece of code executed by the runtime engine located at a 
receiving node. Alternatively, the software agent runs as a 

5Q standalone application using local resources. 

In a representative embodiment, both the runtime engine 
and the software agent(s) are written in Java. As is known in 
the art, Java is an object-oriented, multi-threaded, portable, 
platform-independent, secure programming environment 

ss used to develop, test and maintain software programs. Java 
programs have found extensive use on the World Wide Web, 
which is the Internet's multimedia information retrieval 
system. These programs include full -featured interactive, 
standalone applications, as well as smaller programs, known 

so as applets, that run in a Java-enabled Web browser. 

In one particular embodiment, a software agent is a Java 
applet (e.g., comprised of a set of Java "class" files) and the 
runtime environment includes a Java Virtual Machine 
(JVM) associated with a Web browser. In this illustrative 

65 example, various nodes of the network are part of the 
Internet, an intranet, or some other computer network or 
portion thereof. 
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When the administrator configures a task for deployment, has been configured to act as a distributed monitor, some or 

the dispatch mechanism compiles the appropriate Java class all of the modules 52 are loaded (by the DM loader 52a) 

files (preferably based on the task or some characteristic depending on DM configuration data. Each module 52 then 

thereof) and dispatches the applet (as the software agent) in preferably configures itself at startup time and then begins 

the network. An applet is then executed on the JVM located 5 its appointed tasL After the environment has completed 

at a receiving node. initialization, it periodically scans its timer list for scheduled 

The runtime environments located across a set of given work and waits for commands, 

managed machines collectively comprise a management The local runtime environment preferably supports at 

infrastructure deployed throughout the computer network. lcast ^ 0 ( 2 ) types of software agents: service and monitor- 

FIG. 5 is a block diagram of a preferred 1 runume environ- M ^ A t ^ ^lized at startup by the DM 

ment. ITie runtime environment is a platform-level service Loadcr ^ ^ it must thc Java Scrvicc mtcr . 

that can load and execute software agents. The environment facc A KTvkc t ^ res ^ for cxtcnd ing the base 

41 includes the runtime engine 42 a task manager 44 a of ^ environment. A monitoring agent is 

loader 46 a timer service 48^ and a distributed monitor 50. for making a measurement, comparing the mea- 

The distn^utedmomtor (DM)50comp^ 15 SXSKd yamc ^ ^ performing a response 

modules Sla-f run by the runtime engine 42 and that allow for o^^f.toierance conditions, 

the environment to perform monitoring activities. The par- . . 

A . . . A . ..... r 7 r » 5 Agents may be parametric, with the parameters bmding to 

ticular monitonng activities performed, of course, depend on ^ , * - * j . *»_ « i. ,l « 

iU , ? j L * t • ii L values when the agent is stored m the profile, when the agent 

the resources being managed, but typically such resources 4 , , . . . A . r \ . j 

i j . j - * . . .j is run on the e nop oint, or both. Agent parameters can be used 

include storage devices and subsystems, printers, given „ 4 , , r 1J . , ot ^\ * i 

f, . , . , 20 to specify threshold values (>80%), response task arguments 

programs and/or tasks, and any other managed resource. ^ ^ 3) or V otner values ^ last 25 

Generally, any such system, subsystem, device, resource, values for ra bin ^ 
program or the like may be deemed a "managed object." If . 7 ■ . 

desired the runtime environment components may be used , f* e ° te ™ start * d ^.^ 

for activities other than monitoring (e.g, remote task „ loa< L er 52 « at sUrtu P Ume - Each DM can be individually 

execution"! configured to use one or more service agents. Service agents 

A . , . , , . are identified in the dispatch mechanism repository and are 

A representative monitonng operation mvolves making a 4 , A 4l _ r , J . 

r • *t j 1 • * propagated to the DMs across the managed environment 

measurement, comparing the measured value against r . r ~, , . , . .. ... 

*u u u/- \ j -E ■ c * 1 when the agent profile is pushed even though the service 

thresholds), and performing a response for out-of -tolerance . , * * *l c , ^ •* * i 

a n j agent does not appear m the profile. The repository tracks 

conditions. A monitoring agent is internally organized as a ™ . , / , .z,. m / , % 

. .« • 1 j . , 30 which service agents are configured on which DMs, and it 

program, with measurement, threshold comparison and .« 4 . f u ~ ~P . f , 7 ~ kJf 

r & , . >-pi .1 . . -ui •* • w^ 1 distribute the configuration information to the DM 

response elements. To the extent possible, a monitonng , , ... 4 & £1 . . 

^ . . tt , • . * j j*** l loader when the agent profile is pushed, 

agent must attempt to correct the detected condition when a », . . 

threshold has been exceeded. Monitonng agents are defined in the repository and con- 

rpr t"\\# 1 j . , « i figured to run on different DMs via agent profiles. As 

The DM loader 52a controls the other DM modules. The 35,? , . ... , - j 

. . . ™, ^ discussed above, momtonng agents are organized as a 

event correlator 52o implements event correlation. There are ... * .1 u u • j 

. r »• ,u » •# 11 u • program, with measurement, threshold comparison and 

a number of operations that many monitors will have in , ^. ; c 1-1 

A f * ,u i . 1 ... response elements. The runume environment preferably 

common. A set or these are implemented as momtonng * * .» a j ■* • / 

intrinsi« (tasks) 52c, and these Usks are available to all ? ro ^ es ' ^ ° f cumnfly^oiifigund momtonng agents, 

v . 'rp. r li • i , j • t including the following information: 

monitor agents. They arc preferably implemented in Java, 40 M r 

thus if the distributed monitor invokes an intrinsic task the Name of *& eDt 

DM will already have the Java implementation for it. Namc of *& clA CoUection 

Moreover, monitor agents may take local action, remote M &ni Index (unique ID for AE) 

action, or send an event in response to an out-of-tolerance Status (WAIT, RUN, MEASURE, COMPARE, 

condition, with local action strongly preferred. A monitoring 45 RESPOND) 

agent is preferably defined with the response as an integral State (DISABLED, Severity (NORMAL, WARNING, 

part Because an agent may contain logic, it can make the SEVERE, CRITICAL)) 

desired measurements and then respond to the measured Last Value (string format) 

value appropriately (e.g., send event, send e-mail, run task, N ex t Run Time ([DD Days] [HH Hours] MM:SS) 

etc.). Available responses are not a fixed set, but rather 50 FIG. 6 is a block diagram illustrating how the distributed 

another task in the agent. Aset of response tasks are thus also monitor (DM) components interact with the runtime engine 

provided as part of the monitoring intrinsics. 42 to execute or control a software agent 55 configured to 

Routing is provided by the event router 52d as will be perform monitoring. The software agent or task 55 may have 

described in more detail below. Pluggable event modules been deployed from the dispatch mechanism as previously 

(PEMs) 52e are used to integrate new event sources/ 55 described. Inside the DM, the software agent 55 may be 

destinations with the other modules. A PEM is a task that triggered to run via the timer service 48 or due to a control 

may represent an event source, an event destination or both, issued from another monitoring agent (e.g., one agent call- 

and it is started when the distributed monitor 50 starts. The ing another). Outside the DM, the software agent may be 

distributed monitor 50 may be optionally configured to triggered by an event via a PEM S2e, from input queue 57, 

perform basic HTTP server duties (e.g., servicing HTTP 50 or from a command issued from command processor 56. 

GET requests, where the URL of the GET may be a DM Thus, a distributed monitor (DM) within a given local 

topology request, or a request for a status of a particular runtime environment uses "events" to convey status change 

DM). The HTTP interface 52/ is responsible for turning the (s) in monitored objects). Events are correlated, as will be 

requested data into HTML and returning it to the calling seen, using an event correlator comprising a correlation 

browser. 65 engine 65 and a set of correlation rules 67. 

The runtime environment may be configured to load and FIG. 7 illustrates the operation of the event routing 

run software agents at a startup time. When the environment module 52b. In many cases, the event source may not be able 
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to deliver the event directly to the destination and must send Referring back to FIG. 7, the input event queue 57 thus 

the event to an intermediate location. A given monitor may processes each event (oldest event first) against each entry in 

be a valid intermediate, and thus the monitor should also be the registration list. If no registration entries match the 

able to send and receive events and to logically route them event, it is moved to the output queue. If the event does 

(e.g., to locations internal and external to the monitor). 5 match a registration entry it can be moved or copied into the 

These functions are carried out by the event routing module. correlation engine. A correlation match may trigger an 

The module includes the input queue 57, a registration list automation to run or continue. The output event queue 61 is 
59, and an output queue 61. When the distributed monitor configured to transmit events from the DM to other loca- 
receives an event, it first checks the registration list 59 for a ^ In particular> ou tput qucuc 61 takes each event and 
match. If any internal module (m the DM 50) has expressed M ar6S its dass against mc class qualifier in each route 
interest in the event class, the event is sent to a correlation en ^ for a match A match ^ detected whcn m6 even , class 
engine 65 and (possibly) removed from the input queue 57 hashcs tQ mc sam6 yahe M onc of ^ route ^ entries 
If the class is not m the registration list, the event is moved a match b detected ^ cymi ^ ma±ed ^ ^ 
to the output queue 61, where it interacts with other routing jjj If mc matched route list entry has a disposition of 
data. As will be described below, each software agent can 15 Contimle> men mc other rout6 list enuies m u^^, other- 
register a correlation rule for a given event which will cause ^ ^ event ^ finished, 
the software agent to run when the event is received The pj G 8 ^ ^ eration of ^ event correlator in 
correlation rule can instruct the correlation engine 65 to mQK detai , ^ abov6i the event correlator corn- 
consume or simply copy the event. ^ a correlation engine 65 and a set of correlation rules 

While processing, given software agents may generate ffl S7 . Correlation rules 67 are components of or adjuncts to a 

events (as can the distributed monitor itself), and those event ^ ven software t ^ ^ , contex , ^ which to 

may be placed on the output queue 61. The output queue is ^ Qr to mmUte system cyents Prcfcrably> thc mm _ 

processed against a routing list and, as a result the event ktion ^ 67 are con fi gured at buM time for me p Ur pose 

may be sent to a destination external to the distributed of exammin a set of events for some observable 

monitor (or logged/discarded). Once the event is placed in M conditio* ^ a ^ ven correlation rule 67n identifies an 

me output queue 61, it preferably cannot be routed back to abs(ract of whicn me eyents it addresses ^ 

an internal DM module The output queue 61 is responsible loms . It ^ relates events to a more generic 

for efficient and reliable delivery of events. Event classes problem 

tagged for reliable delivery will be queued until delivered, ^ correlation rule 67 may be implemented as a simple 

which includes the cases where the distributed monitor is x software _ base d "state machine" and thus the set of correla- 

terminated or where the destination is not avadable for an tkm rules m sometimes referred to herein a set of 

extended period I of time. The total amount of output queuing efficien ay.coupled state machines for use in correlating 

space per distributed monitor preferably is configured on a events ^ ^ be ^ below> because each particular state 

per-DM b asis When tins space is exhausted, the oldest machine has a rel&Uvel si k low leyel fme&m ^ even , 

events preferably are purged to make room for newer ones, 3J correlation is much faster than is accomplished with more 

an da DM event will be generated. , , u high level correlation methods (e.g., an inference engine). In 

Each DM preferably receives routing data for the output- ^ fened embodiment of the present invention, there are 

queue 61 when an software agent profile push (e.g., from the fiv(J (5) basi{ . rf ^ machines or correlation rules 67: 

dispatch mechanism) is received. If the event topology is . . , 

very dynamic, a software agent may be used to force the DM M Matcbm S Rule * are m0S J f m P le c ? mmon U A 

to reload its routing data at some fixed interval. When the 40 matching rule is triggered by an event that matches the 

routing data is reloaded, all queued events are reprocessed * arch cntem - matcma S mlc has a ^ 

and delivery is attempted again. degenerate state. 

Each distributed monitor thus preferably contains a table Duplicate Rules are designed to reduce the event flow 

of routing information that is used to forward events up the 4S traffic. Once a duplicate rule is triggered, it ignores 

network topology (or deliver them to the final destination). subsequent events of the same type for a specified 

This routing table is available on-demand from an event period of time. 

routing service or some other source, and is stored (in PassThrough Rules are more complex matching rules that 

memory and in a start-up file) by the DM. The elements of are triggered by a specific sequence of events. This 

a routing entry are: $Q sequence can be in either specific or random order. 

Event Class — A regexp tested against the events class Reset Rules are opposites of PassThrough rules. They are 

name. A match causes this route entry to be used triggered only if the specific sequence of events does 

Destination — The name of the destination for the matched not occur; and 

event class. For some destination types this may be Threshold Rules look for a specific number of the same 

empty. 55 type of event Once this limit is reached, the rule is 

Destination Type — Class of destination. Choices are: DM, triggered 

EventConsole, EIF, LOG, DISCARD, MLM, IND. Thus, a given software agent may have associated there- 
Disposition — What to do with the event after the route with a set of state machines each of which is responsive to 
entry has been completed, or what to do if the route or that recognizes a given "pattern" of one or more events, 
destination is not available. Choices are: C, T go In this simplest case (namely, a matching rule), the pattern 
(Continue, Terminate) involves just a single rule. Other rules have more complex 
The route table for any given DM will be computed based patterns associated therewith. The set of state machines 
on the event topology data which is available at the man- define a palette of event patterns within the correlator. A 
aging server and each LCF gateway. Any distributed monitor given correlator may be limited with a similar set of these 
can send a request for a new route table, which will be 65 rules, with multiple versions of a given rule, or with just a 
computed based on the latest topology information avail- single rule. Of course, the rule semantics described above 
able. are merely illustrative, as many other types of rules may be 
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devised and readily implemented in a given correlator The many ways the correlation engine can route an event 

associated with a software agent. The advantage of this are now described. In most cases, the flow of events along 

approach is that a relatively small set of rules can be the service chain win remain unmodified. If there are active 

established and then used for optimum correlation (with correlation rules present, they receive copies of appropriate 

respect to these particular rules). 5 events, which are also forwarded to the output queue 61. The 

Returning now back to FIG. 8, within the context of a rcS ultant sequence of events is the same as the initial 

software agent, the correlation rules comprising the set of sequence 

state machines preferably act either as "triggers" or "com- Thcre m ^ however, which may modify the 

ponents". If a correlation rule is a trigger, it sits at the cycnt strcam Firs ^ a am ^ Dn ^ maV) « stcal » an cvent 

beginning of a chain of software agents. At runtime, the state „ rt r T „ tU - tu- 

f j « • .u t 1 • . j • j. 10 from the stream. In this case, when the correlation engine 

machine defining the correlation rule is activated immedi- , 4 4 f , , t7 . 

ately. Tnen, once the correlation criteria (namely, the scnd f " c ™ nt !° onc or m k orc of lts t rc ^ tcrc f * at 

pattern) of the rule are met, the state machine fires an event ™ ni * lo f to mc service cham event stream. A correlation 

to start the software agent running. Conversely, the state ^ has a "consume- attribute that determines this behavior. 

machine defining a correlation rule is an embedded compo- ff ^ attribute is set, then the rule consumes the event; if not, 

nent of the software agent. In this case, the state machine sits * 5 ^en correlation rule receives a copy of the event (which 

somewhere along a chain of software agents and remains ^ ^ forwarded to the output queue 61). In the second case, 

deactivated until it receives the flow of control. a software agent can re-insert an event into the event stream. 

From an implementation perspective, a correlation rule There are several scenarios in which an agent can return an 

may be implemented as a Java "Bean" wrapped around a internally modified event to the input queue 57 but such an 

specifically configured rule object. This rule object contains 20 operation is not without risk. In particular, if the re-inserted 

a processEvent ( ) method that does all of the real event event is again processed by the correlation engine, there is 

processing work, a possibility that the software agent will be caught in an 

The correlation engine 65 preferably sits within a chain of event processing loop. To avoid such a situation, events can 

services running within a runtime engine 42 in the local be flagged as "modified" when they are returned to the input 

runtime environment. In addition to routing events along 25 queue. When such a flagged event is again examined by the 

this service chain, its main duty is to identify events spe- correlation engine, it will pass the event to the output queue 

cifically addressed by correlation rules 67 and to forward without sending it to a correlation rule. Once the event has 

them accordingly. In order to keep track of currently active left the correlation engine, its "modified" flag is cleared, 

correlation rules, the correlation engine 65 uses the regis- Once a correlation rule is finished processing events, e.g., 

tration list 59 (of FIG. 7) to group rules according to the 30 either its criteria have been met or a specified timeout has 

types of events they address. As noted above, a correlation occurred, it immediately de-activates itself. In the event that 

rule is preferably a Java Bean wrapped around a rule object; the correlation criteria have been satisfied, the correlation 

registration simply means the correlation rule passes a rule has the option of forwarding one, some, or all of the 

reference of its rule object to the correlation engine, which correlated events to whomever is listening. In most cases, 

groups it according to event type. 35 this will be another agent component which aggregates the 

The correlation engine further categorizes its registered resultant correlation events into some useful format and 

rules as either active or in-active. This distinction is impor- forwards it to whomever is listening to the agent as a whole, 

tant when considering the role a correlation rule plays within In certain cases, however, such an agent component can take 

or in association with a software agent If a correlation rule the events received from the correlation rule (which are 

acts as a trigger within the software agent, the rule activates 40 themselves events plucked out of the service chain event 

itself at registration time and starts processing events imme- stream), modify them in some way, and re-insert them in the 

diately. If a correlation rule is an embedded component of a input queue. In these cases, the re-inserted events must be 

software agent, it remains de-activated at registration time. marked as modified before they are returned to the event 

Then, only when control flows to the rule does it activate stream in order to avoid an event processing loop, 

itself. In both cases, the correlation rule remains active long 45 It should be appreciated that the present invention is not 

enough for either its correlation criteria to be met or a limited to monitoring any particular type of resource in the 

specified time to occur. distributed computer network. In a network of computing 

The correlation engine is always routing events, even if machinery, a variety of systems, devices and/or components 

there are no correlation rules present, along the service at different levels of conceptual complexity may be consid- 

chain. Thus, the correlation engine 65 is a constantly running 50 ered "resources". Thus, for example, some resources are 

service that collects events from the input queue 57 into its actual hardware components of the network or the comput- 

own internal queue (not shown). A separate thread continu- ing machinery, such as network routers or disk drives. Some 

ously grabs an event from this queue, if available, and resources are properties of the operating systems running on 

processes it within a given context. In the absence of any computers in the network, such as the set of active appli- 

correlation rules (the simplest case), this "event processing" 55 cations or open files. Still other resources are applications 

merely entails forwarding the event to the output queue 61. running on computers in the network. Finally, some aggre- 

Thus, in this context, the correlation engine acts as a simple gations of these types of resources may be considered 

event router. high-level resources for management purposes. Examples of 

When a software agent containing a correlation rule is such high-level resources are: database management sys- 

loaded into the local runtime environment, the correlation 60 terns that include one or more server processes, client 

rule registers itself with the correlation engine and, depend- processes and their connections to the server, file systems 

ing on its role within the software agent, immediately (disk drives), and operating system resources, or distributed 

identifies itself as either active or de-active. As the correla- computation systems that include server applications on 

tion engine processes events, it checks whether the current various computers, interconnections, file and memory 

event is addressed by any registered, active correlation rules. 65 resources, and communication subsystems. 

If there is a match, the correlation engine sends the event to Such distributed computing resources, in order to be used 

the rule processEvent ( ) method. as critical components for the operation of an organization, 
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must be continually monitored. The present invention thus 
illustrates how mobile software components (preferably 
Java-based) are deployed in the network and locally 
executed, preferably using a Java-based runtime 
environment, to perform a given monitoring task with 5 
respect to any such resource(s). The use of Java-based 
agents and runtime engines ensures that the components are 
easily portable across disparate system machines even as the 
network grows to include many thousands of connected, 
managed computers. 1Q 

One of the preferred implementations of the runtime 
monitor is as a set of instructions in a code module resident 
in the random access memory of a computer. Until required 
by the computer, the set of instructions may be stored in 
another computer memory, for example, in a bard disk drive, 
or in a removable memory such as an optical disk (for 15 
eventual use in a CD ROM) or floppy disk (for eventual use 
in a floppy disk drive), or even downloaded via the Internet. 

In addition, although the various methods described are 
conveniently implemented in a general purpose computer 
selectively activated or reconfigured by software, one of 20 
ordinary skill in the art would also recognize that such 
methods may be carried out in hardware, in firmware, or in 
more specialized apparatus constructed to perform the 
required method steps. 

Further, although the invention has been described in 25 
terms of a preferred embodiment in a specific network 
environment, those skilled in the art will recognize that the 
invention can be practiced, with modification, in other and 
different network architectures with the spirit and scope of 
the appended claims. Moreover, the inventive diagnostic 30 
technique should be useful in any distributed network envi- 
ronment. 

The present invention provides numerous advantages 
over the prior art. The technique is readily scaleable as the 
distributed network increases in size. An administrator is 35 
able to add new monitoring capabilities to an appropriate 
subset of the endpoints with minimal effort. The distributed 
monitoring application behaves predictably under peak 
conditions, and individual monitors may be readily adjust- 
able to consume a minimum portion of the available com- 40 
puling resources (network, CPU, disk). 

The invention also provides more built-in local response 
capability. It is insufficient to note that a monitored value is 
out of tolerance. Whenever possible, a local attempt to 
correct the situation must be made. Some errors have no 45 
local remedy, such as a client application which is dependent 
on a single server. In these cases, the response must have a 
corresponding remote action which can be triggered by the 
client error. The invention also provides more efficient 
monitor execution, even though the monitor is executing on 50 
systems which have less capacity than previous endpoints. 
The distributed monitor is written as efficient as possible in 
terms of CPU and memory requirements. Further, the moni- 
tor is easy to integrate with other event systems. The 
distributed monitor thus provides a good bridge between 55 
system management and network management, allowing 
network-specific events to be used in application problem 
diagnosis. 

Having thus described our invention, what we claim as 
□ew and desire to secure by letters patent is set forth in the 60 
following claims. 

What is claimed is: 

1. A method of monitoring in a distributed computer 
network having a set of managed computers, wherein 
instances of a runtime engine are deployed across a given 65 
subset of the managed computers, the method comprising 
the steps of: 
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at a given managed computer, establishing an event class 
registration list; 

upon receipt of an event having an event class associated 
therewith, examining the registration list to determine 
whether a given monitoring task has expressed interest 
in the event class; 

processing the event through a correlator if the monitoring 
task has expressed interest in the event class; and 

taking a given action if a condition expressed in a corre- 
lation rule associated with the monitoring task has been 
met. 

2. The method as described in claim 1 wherein the given 
action initiates the given monitoring task. 

3. The method as described in claim 2 wherein the given 
monitoring task is executed in the runtime engine to deter- 
mine whether a given threshold has been exceeded. 

4. The method as described in claim 3 wherein the 
monitoring task takes a measurement and compares the 
measurement against the given threshold. 

5. The method as described in claim 4 further including 
the step of having the monitoring task attempt to correct a 
detected condition when the given threshold has been 
exceeded. 

6. The method as described in claim 1 wherein the step of 
processing the event further includes consuming the event. 

7. The method as described in claim 1 wherein the step of 
processing the event includes delivering the event to an 
output queue for subsequent delivery to a target location. 

8. An apparatus for monitoring in a distributed computer 
network having a set of managed computers, wherein 
instances of a runtime engine are deployed across a given 
subset of the managed computers, comprising: 

list means, at a given managed computer, for establishing 
an event class registration list; 

examination means, upon receipt of an event having an 
event class associated therewith, for examining the 
registration list to determine whether a given monitor- 
ing task has expressed interest in the event class; 

processing means for processing the event through a 
correlator if the monitoring task has expressed interest 
in the event class; and 

means for taking a given action if a condition expressed 
in a correlation rule associated with the monitoring task 
has been met. 

9. The apparatus as described in claim 8 wherein the given 
action initiates the given monitoring task. 

10. The apparatus as described in claim 9 wherein the 
given monitoring task is executed in the runtime engine to 
determine whether a given threshold has been exceeded. 

11. The apparatus as described in claim 10 wherein the 
monitoring task takes a measurement and compares the 
measurement against the given threshold. 

12. The apparatus as described in claim 11 wherein the 
monitoring task attempts to correct a detected condition 
when the given threshold has been exceeded. 

13. The apparatus as described in claim 8 wherein the 
processing mean further includes means for consuming the 
event. 

14. The apparatus as described in claim 8 wherein the 
processing means includes means for delivering the event to 
an output queue for subsequent delivery to a target location. 

15. A method of monitoring a resource in a distributed 
computer network having a management server servicing a 
set of managed computers, comprising the steps of: 

deploying instances of a runtime engine across a subset of 
the managed computers; 
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• deploying a software agent into the computer network; 

at a given managed computer, upon receipt of an event 
that satisfies a condition established by a correlation 
rule associated with the software agent, executing the 
software agent using the runtime engine to determine 
whether a given threshold has been exceeded; and 

taking a given action if the given threshold has been 
exceeded. 

16. The method as described in claim IS wherein the step 
of executing the software agent includes taking a measure- 
ment and comparing the measurement against the given 
threshold. 

17. The method as described in claim 15 wherein the 
given action comprises attempting to correct a detected 
condition when the given threshold has been exceeded. 

18. The method as described in claim 15 wherein the 
given action comprises outputting an event when the given 
threshold has been exceeded. 

19. The method as described in claim 15 wherein the 
runtime engine includes a Java virtual machine and the 
software agent is a Java applet. 

20. A computer program product, in a computer readable 
medium, for monitoring a resource in a distributed computer 
network having a management server servicing a set of 
managed computers, comprising: 

instructions for deploying instances of a runtime engine 
across a subset of the managed computers; 

instructions for deploying a software agent into the com- 
puter network; 

instructions, at a given managed computer, upon receipt 
of an event that satisfies a condition established by a 
correlation rule associated with the software agent, for 
executing the software agent using the runtime engine 
to determine whether a given threshold has been 
exceeded; and 

instructions for taking a given action if the given thresh- 
old has been exceeded. 

21. A computer program product, in a computer readable 
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instructions, at a given managed computer, for establish- 
ing an event class registration list; 

instructions, upon receipt of an event having an event 
class associated therewith, for examining the registra- 
tion list to determine whether a given monitoring task 
has expressed interest in the event class; 

instructions for processing the event through a correlator 
if the monitoring task has expressed interest in the 
event class; and 

instructions for taking a given action if a condition 
expressed in a correlation rule associated with the 
monitoring task has been met. 

22. An apparatus for monitoring a resource in a distributed 
computer network having a management server servicing a 
set of managed computers, comprising: 

first deployment means for deploying instances of a 
runtime engine across a subset of the managed com- 
puters; 

second deployment means for deploying a software agent 
into the computer network; 

execution means, at a given managed computer, upon 
receipt of an event that satisfies a condition established 
by a correlation rule associated with the software agent, 
for executing the software agent using the runtime 
engine to determine whether a given threshold has been 
exceeded; and 

means for taking a given action if the given threshold has 
been exceeded. 

23. The apparatus as described in claim 22 wherein the 
execution means includes means for taking a measurement 
and comparing the measurement against the given threshold. 

24. The apparatus as described in claim 22 wherein the 
given action comprises attempting to correct a detected 
condition when the given threshold has been exceeded. 

25. The apparatus as described in claim 22 wherein the 
given action comprises outputting an even when the given 
threshold has been exceeded. 

26. The apparatus as described in claim 22 wherein the 



medium, for monitoring in a distributed computer network 40 runtime engine includes a Java virtual machine and the 



having a set of managed computers, wherein instances of a 
runtime engine are deployed across a given subset of the 
managed computers, comprising: 



software agent is a Java applet. 
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